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Abstract. We introduce a new cryptographic tool: multiset hash functions. Unlike standard 
hash functions which take strings as input, multiset hash functions operate on multisets (or 
sets). They map multisets of arbitrary finite size to strings (hashes) of fixed length. They are 
incremental in that, when new members are added to the multiset, the hash can be updated 
in time proportional to the change. The functions may be multiset-collision resistant in that 
it is difficult to find two multisets which produce the same hash, or just set-collision resistant 
in that it is difficult to find a set and a multiset which produce the same hash. 


In particular, we introduce four multiset hash functions, each with its own advantages. 
MSet-XOR-Hash uses the XOR operation and is very efficient; however, it uses a secret key 
and is only set-collision resistant. MSet-Add-Hash uses addition modulo a large integer and, 
thus, is slightly less efficient than MSet-XOR-Hash; MSet-Add-Hash also uses a secret key but it 
is multiset-collision resistant. MSet-Mu-Hash uses finite field arithmetic and is not as efficient 
as the other two hash functions; however, MSet-Mu-Hash is multiset-collision resistant, and 
unlike the other two hash functions, does not require a secret key. MSet-VAdd-Hash is more 
efficient than MSet-Mu-Hash; it is also multiset-collision resistant, and does not use a secret 
key, but the hashes it produces are significantly longer than the hashes of the other functions. 


The proven security of MSet-XOR-Hash and MSet-Add-Hash is quantitative. We reduce the 
hardness of finding collisions to the hardness of breaking the underlying pseudorandom 
functions. The proven security of MSet-Mu-Hash is in the random oracle model and is based 
on the hardness of the discrete logarithm problem. The proven security of MSet-VAdd-Hash 
is also in the random oracle model and is based on the hardness of the worst-case shortest 
vector problem. 


We demonstrate how set-collision resistant multiset hash functions make an existing offline 
memory integrity checker secure against active adversaries. We improve on this checker such 
that it can use smaller time stamps without increasing the frequency of checks. The improved 
checker uses multiset-collision resistant hash functions. 


Keywords: incremental cryptography, multiset hash functions, set-collision resistance, multiset- 
collision resistance, memory integrity checking 


1 Introduction 


Standard hash functions, such as SHA-1 [NIS95] and MD5 [Riv92], map strings of arbitrary 
finite length to strings (hashes) of a fixed length. They are collision-resistant in that it 
is difficult to find different input strings which produce the same hash. Incremental hash 
functions, described in [BGG94], have the additional property that, given changes to the 
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input string, the computation to update the hashes is proportional to the amount of change 
in the input string. For a small change, incremental hashes can be quickly updated, and 
do not need to be recalculated over the entire new input. 

Multiset hash functions are a novel cryptographic tool, for which the ordering of the 
inputs is not important. They map multisets of arbitrary finite size to hashes of fixed 
length. They are incremental in that, when new members are added to the multiset, 
the hash can be quickly updated. Because multiset hash functions work on multisets, we 
introduce definitions for multiset-collision resistance and set-collision resistance. 

We demonstrate how multiset hash functions enable secure offline integrity checkers 
for untrusted memory. Offline memory integrity checking was introduced by Blum et al. 
[BEG*91]. However, their implementation of offline checkers uses ¢-biased hash functions 
[NN90]; these hash functions can be used to detect random errors, but are not crypto- 
graphically secure. We prove that multiset hash functions are cryptographically secure, and 
show how they can be used in place of ¢-biased hash functions to build an offline checker 
secure against active adversaries. Furthermore, we introduce an improved checker that can 
perform better because it uses smaller time stamps without increasing the frequency of 
checks. Checking the integrity of memory is important in building secure processors which 
can facilitate software licensing and Digital Rights Management (DRM) [SCGT03]. 

The paper is organized as follows. Section 2 describes related work and summarizes 
our contributions. Multiset hash functions are defined in Section 3. MSet-XOR-Hash and 
MSet-Add-Hash are described in Section 4, and MSet-Mu-Hash and MSet-VAdd-Hash are 
described in Section 5. Our application of multiset hash functions to checking the integrity 
of memory is detailed in Section 6. Section 7 concludes the paper. Appendices A, B, C, 
and D prove the security of our multiset hash functions. Appendix E proves the security 
of our new memory integrity checker. 


2 Related Work and Our Contributions 


The main contribution of our work is the introduction of multiset hash functions together 
with the definition of multiset and set collision resistance. The second contribution is the 
development of a general theory leading to Theorem 1 from which we derive set-collision 
resistance for MSet-XOR-Hash, a multiset hash based on the XOR operation (addition 
modulo 2), and multiset-collision resistance for MSet-Add-Hash, a multiset hash based on 
addition modulo a large integer. The theory generalizes the results in [BGR95], where 
an XOR-based scheme is used for message authentication. Our theory holds for addition 
modulo any integer. 

Both MSet-XOR-Hash and MSet-Add-Hash use a secret key. The third contribution is 
Theorem 2 that proves multiset-collision resistance for MSet-Mu-Hash, a multiset hash 
function based on multiplication in a finite field; MSet-Mu-Hash does not use a secret key. 
The proof’s basic line of thought is from [BM97]| which develops message hashing based 
on multiplication in a finite field. The fourth contribution, leading to MSet-VAdd-Hash, 
is Theorem 3 proving that we may replace multiplication in the finite field by vector 
addition modulo a large integer. In [BM97], a similar theorem is used for message hashing. 
Our theorem (and their theorem) follows directly from application of Ajtai’s theorem 
[(GGH96, Ajt96]. 


Our final significant contribution is that we introduce an offline checker that is cryp- 


tographically secure and which improves on the performance of the offline checker in 
[BEG*91]. 


3 Multiset Hash Functions 


This section describes multiset hash functions. We first introduce multisets. We refer to a 
multiset as a finite unordered group of elements where an element can occur as a member 
more than once. All sets are multisets, but a multiset is not a set if an element appears 
more than once. Let M be a multiset of elements of a countable set B. The number of 
times b € B is in the multiset M is denoted by Mz and is called the multiplicity of b in 
M. The sum of all the multiplicities of M is called the cardinality of M. Multiset union 
combines two multisets into a multiset in which elements appear with a multiplicity that 
is the sum of their multiplicities in the initial multisets. We denote multiset union by U 
and assume that the context in which U is used makes clear to the reader whether we 
mean set union or multiset union. 


Definition 1. Let (H,+1,=n) be a triple of probabilistic polynomial time (ppt) algo- 
rithms. That triple is a multiset hash function if it satisfies: 


compression: 1 maps multisets of B into elements of a set with cardinality ~ 2, where 
m is some integer. Compression guarantees that we can store hashes in a small bounded 
amount of memory. 

comparability: Since H can be a probabilistic algorithm, a multiset need not always hash 
to the same value. Therefore we need = to compare hashes. The following relation 
must hold for comparison to be possible: 


for all multisets M of B. 
incrementality: We would like to be able to efficiently compute H(M U M"') knowing 
H(M) and H(M'). The + operator makes that possible: 


H(M UM’) =n H(M) + H(M’) 


for all multisets M and M’ of B. In particular, knowing only H(M) and an element 
b€ B, we can easily compute H(M U {b}) = H(M) +H H({0}). 


As it is, this definition is not very useful, because H could be any constant function. 
We need to add some kind of collision resistance to have a useful hash function. A collision 
for M’ is a multiset M #4 M’ such that H(M) =y H(M’). 


Definition 2. A multiset hash function is (multi)set-collision resistant if it is compu- 
tationally infeasible to find a (multi)set S of B and a multiset M of B such that the 
cardinalities of S and M are of polynomial size in m, S A M, and H(S) = H(M). 


Remark. (Not needed for understanding the remainder of this paper.) To make this notion 
formal, we need to consider a family F of multiset hash functions (Hx, +H; =H, ) indexed 
by a key (seed) K € K. For Hx in F, we denote by mx the logarithm of the cardinality 
of the set into which Hx maps multisets of B, that is mx is the number of output bits of 
Hx. We define K,, as the set of keys K € K for which mx > m. By A(Hx) we denote a 
probabilistic polynomial time (in mx) algorithm with oracle access to (HK,+H, =Hx): 

The family F satisfies (multi)set-collision resistance if for all ppt algorithms A(.), any 
number c, and m large enough (with respect: to c),! 


K —Ky,(S,M) — A(Hxk) : 
Prob< S is a (multi)set and M is a multiset of B a 
such that S # M and Hx«(S) =n, Hx«(M) 


Note that because A(Hx ) is polynomial in mx, we will consider that it can only output 
polynomial sized S and M. We are disallowing compact representations for multisets that 
would allow A(.) to express larger multisets (such compact representations do not lead to 
a feasible attack in our offline memory integrity application). 


4 Additive Multiset Hash 


In this section we give an example of a construction of (multi)set-collision resistant multiset 
hash functions. Let B = {0,1}™ represent the set of bit vectors of length m and let M be 
a multiset of elements of B. Recall that the number of times b € B is in the multiset M 
is denoted by My and is called the multiplicity of b in M. Let Hx : {0,1}! — Z! be 
randomly selected from a pseudorandom family of hash functions [GGM86]. Let 
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and define 


o) 


Hx(M) = | Ax(0,r) + S° MH xK(1,) mod n ; So Ms mod L ; J 
beB beEB 


r—B 


where r € B is a random nonce?. Notice that the logarithm of the cardinality mx of the 
set into which Hx maps multisets of B is equal to 


mx = log(n') + log(L) + log(2"") & 3m. 


We say two triples [h,c,r] and [h’,c’,r’] are equivalent, [h;c;r] =H, [h’;¢;r'], if and 
only if h— Hx(0,r) = h' — Hx(0,r’) modulo n and c = c! modulo L. Notice that checking 
whether Hx (M) =n, Hx (M") is efficient. We define addition of two triples [h; cr] +H, 
[h’; c'; r’] by the result of the computation 


[Hx (0,r") +h -—HAx(0,r) +h’ —Hx(0,r') modn;ct+e mod L; r']|,WB. 
Clearly, Hx (MU M’) =n, Hk (M) +r Hx (M’), hence, (Hx, +Hx, =H) is a multiset 
hash. The proof of the next theorem is in Appendix A. 


1 The probability is taken over a random selection of K in Km (denoted by K — Ky») and over the 
randomness used in the ppt algorithm A(Hx) (denoted by (S,M) — A(Hx)). 
? Note, the set from which r is taken could be smaller than B. 


Theorem 1. It is computationally infeasible to find a multiset M with multiplicities <n 
and a multiset M' such that the cardinalities of M and M’ are polynomial sized in m, 
M #M', and HK(M) =H, HK(M’). 


As an example we consider n = 2 and 1! = m. Then the condition that a multiset 
has multiplicities < 2 simply means that M is a set. This leads to set-collision resistance. 
Furthermore notice that addition modulo 2 defines xor @. 


Corollary 1. (MSet-XOR-Hash) The multiset hash corresponding to 


p] 


Hx(M) = Hx(0,r) 6 BD MHx(1,}) : S> Ms mod 2” ; J 
beB beB 


r—B 


where Hx : {0,1} x B— Z%' is randomly selected from a pseudorandom family of hash 
functions, is set-collision resistant. 


Notice that Hx(M) keeps track of the cardinality of M. If this were not the case then 
Hx(S) and Hx~(M) are equivalent for any S and M with $, = M, modulo n = 2 for 
b € B. This would contradict set-collision resistance. Also notice that r — B is randomly 
chosen. If r was a fixed known constant, then knowledge of n tuples [M* ; Hx (M*)] reveals 
nm vectors 

CB MPHx(1,b) € ZZ. 

beB 
If nm = 2m then with high probability these n vectors span the vector space 25’. This 
means that each vector in 24’ can be constructed as a linear combination of these n 
vectors [BM97]: 


Bai: (e Mist) =@ (e va) Hx(1,b). 
1, 


be B beB \i=l 


Hence, a polynomial sized collision can be constructed for any polynomial sized M. 

In Appendix B we show that for n exponentially large in m, we may remove the 
cardinality }°,<— Ms from the scheme altogether. By taking / = 1 and L = n = 2™ we 
obtain the next corollary. 


Corollary 2. (MSet-Add-Hash) The multiset hash corresponding to 


o] 


Hx(M) = sion + S° My Hx (1,0) mod 2” ; d 
bEB 


r—B 


where Hx : {0,1} x B— Zom is randomly selected from a pseudorandom family of hash 
functions, is multiset collision resistant. 


The main difference between the MSet-XOR-Hash and MSet-Add-Hash is binary addi- 
tion without and with carry respectively. This leads to either set collision resistance or 
multiset collision resistance. 


In Appendix B we show that it is possible to replace the random nonce r by a counter 
that gets incremented on each use of Hx. This removes the need for a random number 
generator from the scheme. Moreover, shorter values can be used for r as long as the 
key is changed when r overflows; this reduces the size of the hash. Also if the weighted 
sum of the hashes Hx(1,b) in Hx(M) is never revealed to the adversary then we can 
remove Hx (0,1) from the scheme altogether. For example, in the case where the weighted 
sums are encrypted by using a pseudorandom family of permutations (see Corollary 4 in 
Appendix B). 


5 Multiplicative Multiset Hash 


A multiset-collision resistant multiplicative multiset hash can be defined as follows. Let q 
be a large prime power and consider the computations in the field GF(q). Let H : B — 
GF(q) be a poly-random function [GGM86], that is, no polynomial time (in the logarithm 
of q) algorithm with oracle access H can distinguish between values of H and true random 
strings, even when the algorithm is permitted to select the arguments to H (in practice 
one would use MD5 [Riv92] or SHA1 [NIS95]). We define 


H(M) = [[ HO)”, (1) 


beB 


= to be equal to =, and +4 to be multiplication in GF(q). 

Clearly, (H,+1,=n) is a multiset hash. An advantage of the scheme is that we do 
not need a secret key. Unfortunately it relies on finite field arithmetic, which makes it too 
costly for some applications. 

The proof of the following theorem is given in Appendix C, where we also define 
the discrete log (DL) assumption which says that for random y € GF(q) and generator 
g € GF(q), it is computationally infeasible to find x such that g” = y (x is called the 
discrete log of y). 


Theorem 2. (MSet-Mu-Hash) Under the DL assumption, (H,+,=n) as defined in (1) 
is multiset collision resistant. 


Under certain assumptions we may replace multiplication in GF'(q) by addition modulo 
a large number. Even though the number of output bits of the resulting multiset hash needs 
to be much larger (since it is based on ‘weaker’ assumptions), the overall solution becomes 


more efficient since no finite field arithmetic is needed. Let H: B > Z!, n = 2Vv™, 
| = \/m, be a poly-random function. Now, we define 
H(M) = 5° M,H(b) mod n, (2) 


beB 


= to be equal to =, and +7 to be vector addition modulo n. See Appendix D for the proof 
of the next theorem and the definition of the worst-case shortest vector (SV) problem. 


Theorem 3. (MSet-VAdd-Hash) By assuming that the SV problem is infeasible to solve 
in polynomial time, (H,+n,=n) as defined in (2) is multiset collision resistant. 


Remark. Because H can be evaluated with oracle access to H, Theorems 2 and 3 still hold 
for a stronger form of multiset-collision resistance, in which it is computationally infeasible 
for an adversary with oracle access to H (instead of H) to find a collision. This is what 
allows to use a publicly available H. 


6 Integrity Checking of Random Access Memory 
We now show how our multiset hash functions can be used to build secure offline integrity 


checkers for memory. Section 6.1 explains the model, and Section 6.2 shows how our offline 
checker improves the offline checker in [BEG*91]. 


6.1 Model 


FSM checker untrusted RAM 
store 


write 


read 


load trusted 
<< _ Biers 
Fig. 1. Model 


Figure 1 illustrates the model we use. There is a checker that keeps and maintains 
some small, fixed-sized, trusted state. The untrusted RAM (main memory) is arbitrarily 
large. The finite state machine (FSM) generates loads and stores and the checker updates 
its trusted state on each FSM load or store to the untrusted RAM. The checker uses its 
trusted state to verify the integrity of the untrusted RAM. The trusted computing base 
(TCB) consists of the FSM, and the checker with its trusted state. For example, the FSM 
could be a processor. The checker would be special hardware that is added to the processor 
to detect tampering in the external memory. 

The checker checks if the untrusted RAM behaves correctly, i.e. like valid RAM. RAM 
behaves like valid RAM if the data value that the checker reads from a particular address 
is the same data value that the checker most recently wrote to that address. In our model, 
the untrusted RAM is assumed to be actively controlled by an adversary. The untrusted 
RAM may not behave like valid RAM if the RAM has malfunctioned because of errors, 
or if it has been somehow altered by the adversary. 

For this problem, a simple solution such as calculating a message authentication code 
(MAC) of the data value and address, writing the (data value, MAC) pair to the address, 
and using the MAC to check the data value on each read, does not work [LTM*00]. The 
approach does not prevent replay attacks: an adversary can replace the (data value, MAC) 
pair currently at an address with a different pair which was previously written to the 
address. The essence of an offline checker is that a “log” of the sequence of FSM operations 
is maintained in fixed-sized trusted state in the checker. 


6.2. Improved Offline Checker 


Figure 2 shows the basic put and get operations. The checker maintains two multiset 
hashes and a counter. In memory, each data value is accompanied by a time stamp. Each 
time the checker performs a put operation, it appends the current value of the counter (a 
time stamp) to the data value, and writes the (data value, time stamp) pair to memory. 
When the checker performs a get operation, it reads the pair stored at an address, and, 
if necessary, updates the counter so that it is strictly greater than the time stamp that 
was read. The multiset hashes are updated (+,,) with (a, v,t) triples corresponding to the 
pairs written or read from memory. 

Figure 3 shows how the checker implements the store-load interface. To initialize the 
RAM, the checker puts an initial value to each address. When the FSM performs a store 
operation, the checker gets the original value at the address, then puts the new value 
to the address. When the FSM performs a load operation, the checker gets the original 
value at the address and returns this value to the FSM; it then puts the same value back 
to the address. To check the integrity of the RAM at the end of the FSM’s operation, 
the checker gets the value at each address, then compares WRITEHASH and READHASH. 
If WRITEHASH is equal to READHASH, the checker concludes that the RAM has been 
behaving correctly. Intermediate checks can also be performed with a slightly modified 
check operation [CGS*02]. First, it creates a new TIMER and WRITEHASH. Then, as 
memory is read to perform the check, put operations are performed to update the time 
stamps in RAM and the new WRITEHASH. 


The checker’s fixed-sized state is: 


— 2 multiset hashes: WRITEHASH and READHASH. 
Initially both multiset hashes are 0. 
— 1 counter: TIMER. Initially TIMER is 0. 


put(a,v) writes a value v to address a in memory: 


1. Let t be the current value of TIMER. Write (v,t) to a in memory. 
2. Update WRITEHASH: WRITEHASH + ,= hash(a, v, t). 


get(a) reads the value at address a in memory: 
1. Read (v,¢) from a in memory. 
2. Update READHASH: READHASH +7,= hash(a, v, t). 
3. TIMER = max(TIMER, t + 1). 


Fig. 2. put and get operations 


Because the checker checks that WRITEHASH is equal to READHASH, substitution 
(the RAM returns a value that is never written to it) and replay (the RAM returns a 
stale value instead of the one that is most recently written) attacks on the RAM are 
prevented. The purpose of the time stamps is to prevent reordering attacks in which RAM 
returns a value that has not been written yet so that it can subsequently return stale 


initialize() initializes RAM. 
1. put(a,0) for each address a. 


store(a,v) stores v at address a. 


1. get(a). 
2. put(a,v). 


load(a) loads the data value at address a. 


1. v = get(a). Return v to the caller. 
2. put(a,v). 


check() checks if the RAM has behaved correctly (at the end of operation). 


1. get(a) for each address a. 
2. If WRITEHASH is equal to READHASH, return true. 


Fig. 3. Offline integrity checking of random access memory 


data. Suppose we consider the put and get operations that occur on a particular address 
as occurring on a timeline. Line 3 in the get operation ensures that, for each store and 
load operation, each write has a time stamp that is strictly greater than all of the time 
stamps previously read from memory. Therefore, the first time an adversary tampers with 
a particular (data value, time stamp) pair that is read from memory, there will not be 
an entry in the WRITEHASH matching the adversary’s entry in the READHASH, and that 
entry will not be added to the WRITEHASH at a later time. 

The TIMER is not solely under the control of the checker, and is a function of what is 
read from memory, which is untrusted. Therefore, the WRITEHASH cannot be guaranteed 
to be over a set. For example, for a sequence of store and load operations occurring on 
the same address, an adversary can decrease the time stamp that is stored in memory and 
have triples be added to the WRITEHASH multiple times. The READHASH can also not 
be guaranteed to be over a set because the adversary controls the pairs that are read from 
memory. Thus, set-collision resistance is not sufficient, and we require multiset-collision 
resistant hash functions. 

The proof of the following theorem is in Appendix E. 


Theorem 4. Let W be the multiset of triples written to memory and let R be the mul- 
tiset of triples read from memory. That is, W hashes to WRITEHASH and R hashes to 
READHASH. Suppose the accesses to each address are an alternation of puts and gets. 
If the RAM does not behave like valid RAM, then W 4 R. 


The following corollary shows the hardness of breaking our offline memory integrity 
checking scheme. 


Corollary 3. Tampering with the RAM without being detected is as hard as finding a 
collision W # R for the multiset hash function. 


The original offline checker in [BEG*91] differs from our improved checker in two 
respects. First, the original checker is implemented with e-biased hash functions [NN90]}. 
These hash functions are set-collision resistant against random errors but not against a 
malicious adversary. Secondly, the TIMER is incremented on each put operation and is 
not a function of what is read from memory. The TIMER is solely under the control of 
the checker. This means that the pairs that are used to update WRITEHASH form a set. 
Therefore set-collision resistance is sufficient. The original offline checker can be made 
secure by using a set-collision resistant multiset hash function. 

It is worth mentioning how the two offline checkers can be optimized when there is a 
cache in the processor. Both checkers work similarly. The cache contains only data values, 
which are trusted because they are on-chip. In essence, when the cache brings in a data 
value, the checker gets the data value from the address and caches it. When the cache 
evicts a data value, the checker puts the data value to the appropriate address. The cache 
hit rate - the fraction of FSM operations that find their data in the cache - is the same 
for both schemes, and it is actually the same as for the base processor which does not 
provide memory integrity checking mechanisms. In both schemes, the principal perfor- 
mance overhead when sequences of memory operations are checked comes from reading 
and writing time stamps [CGS*02]. The improved offline checker improves on the original 
checker because ‘TIMER is not incremented on every load and store operation. Thus, 
time stamps can be smaller without increasing the frequency of checks, which improves 
the performance of the checker. 


7 Conclusion 


We have introduced incremental multiset hash functions which can be efficiently updated, 
and for which the ordering of inputs is not important. Table 1 summarizes our comparison 
of the multiset hash functions introduced in this paper. In the table, we indicate whether 
the security is based on pseudorandom family of hash functions (PRF), the random oracle 
model (RO), the discrete log assumption (DL), or/and the hardness of the worst case 
shortest vector problem (SV). If hashes are to be visible to the adversary (i.e., the adversary 
can see the hashes in the trusted state, but cannot modify them), we indicate whether a 
random nonce/counter (1), or encryption is necessary. We have improved the security and 
the performance of the offline memory integrity checker in [BEGT91] as one application of 
these functions. We are currently investigating how multiset hash functions can be applied 
in other areas. 


A Proof of Collision Resistance of Additive Multiset Hash 


Let Gm be the family of matrices with 2+! rows, | columns, and entries in Z,, (recall 
Len! x 2™), Let He be a random matrix in Gn = {H,, Ho, H3,...}. Notice that Hx is 
the K-th matrix in G,,. We assume that this matrix, or equivalently its label K, is secret 
and only accessible by the secure processor. The family of matrices G,, from which Hx is 
selected is publicly known. 

The rows of Hx are labeled by x € {0,1}"*! and denoted by Hx (x). This represents 
Hx as a function from x € {0,1}"*! to Z!, the set of vectors with length | and entries 
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collision key | security comput. length of | offline needed for 


resistance based on | efficiency | output checker visible hash 
MSet-XOR-Hash set Y PREF ++ + original | r/encryption 
MSet-Add-Hash multiset Y PREF ++ + both r/encryption 
MSet—Mu-Hash multiset N RO/DL - + both 
MSet-VAdd-Hash | multiset N RO/SV + - both 


Table 1. Comparison of the Multiset Hash Functions 


in Z,y. In practice, Hx is not a completely random matrix over Z,, but Hx is selected 
from a pseudorandom family of functions. We address this issue as soon as we are ready 
to formulate a proof of Theorem 1. 

The following theorem is about the probability that an adversary finds a collision for 
some multiset M’. The probability is taken over random matrices Hx in Gm (Hx — Gm) 
and the randomness of the random nonce used in Hx. 


Theorem 5. Let M and M' be multisets of B. Let d be the greatest common divisor? of n 
and each of the differences |My—M;|, b € B. Given knowledge of u tuples [M' ; H«(M")], 
the probability that M is a collision for M’ is at most u?/2™ + (d/n)!. 


We first introduce some notation. Let u(r, M) be the vector of length 2+! defined by 
u(r, M) op) = 1 if and only if b= r 


and 
u(r, M) (1b) = Mbp. 


Let v(M) be the vector of length 2+! defined by v(M)(o,9) = 0 and v(M) 15) = Mp. 
Lemma 1. (i) Knowing |[M ; Hx(M)]| is equivalent to knowing 
[u(r, M) ; v(r,M)HK mod nj. 


(it) Hk (M) =H. Hx(M’) if and only if o(M)HK = v(M')HK modulo n and Yyep Mp = 
vee My modulo L. 


Proof. Notice that v(r, M) encodes r, M, and, hence, the cardinality }°,< 3 My of M, and 
notice that 


Hx(M) = |v(r,M)HK modn; S> Ms mod L; r|. 
beB 


The lemma follows immediately from these observations. 


3 The greatest common divisor of 0 with a positive integer i is equal to i. 
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Suppose that an adversary learns u tuples [M* ; Hx(M"*)] or, according to Lemma 
1.(i), u vectors u(r’, M*) together with the corresponding u(r‘, M")Hx modulo n. Let A 
be the u x 2”*! matrix with rows v(r’, M*). Then the matrix with rows v(r', M‘)Hx 
is equal to AHx. Clearly, A modulo n has full rank over Z, if all r’ are different. The 
probability that there are two equal r’’s is at most u?/2™. 


Lemma 2. The probability that the r‘’s corresponding to matrix A are all different is at 
least 1 — u?/2™. 


By Lemma 1.(ii), in order to find a collision for M’, the adversary needs to find a 
multiset M # M’ such that v(M)Hx = v(M’) Hx modulo n and such that the cardinalities 
of M and M’ are equal to one another modulo L. The next three lemmas show how difficult 
this is for the adversary if he is in the situation of the previous lemma. 


Lemma 3. Let M and M’ be multisets of B. The probability that v(M)Hx = v(M')HK 
modulo n is statistically independent of the knowledge of a full rank matrix A over Zn 
corresponding to different r’’s and the knowledge of h = AHx modulo n. 


Proof. W.1.0.g. (after reordering the columns of A and the corresponding entries of v(/) — 
v(M"') and corresponding rows of Hx) matrix A has the form A = (I A!'), where J is the 
u xX u identity matrix, and v(M)—v(M’) has the form (0 v), where 0 is the all zero vector 
of length u. Denote the top u rows of Hx by H}. and let Hj, be such that 


Clearly, the equation h = AHx modulo n is equivalent to 
h=H?+A'HE mod n. (3) 
The equation 0 = (v(M) — v(M’))Hx modulo n is equivalent to 
0O=vHz modn. (4) 


Straightforward counting tells us that Prob{(4)|(3)} is equal to the # of matrices Hj, 
satisfying (4) divided by the total # of matrices He This is in turn equal to the # of 
matrices Hx satisfying (4) divided by the total # of matrices Hx, which is Prob{(4)}. 


Lemma 4. Let M and M’ be multisets of B. Let d be the greatest common divisor of 
n and each of the differences |M, — M;\, 6 € B. Then (v(M) — v(M'))Hx modulo n is 
uniformly distributed in dZ'.. 


Proof. To prove this lemma, we show that each entry of (v(M/) — v(M'))HxK modulo n is 
uniformly distributed in dZ,,. Let y represent one of the columns of Hx and define for 
BE Zr, the set 

Ca = {y: (v(M) — o(M"))y = 6 mod n}. 
Since d divides each entry of v(M) — v(M’), it also divides the product (v(M) — v(M’'))y, 
hence, Cg = 9 if @ is not divisible by d. Since d is the greatest common divisor of n and 
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each of the entries of v(/7) —v(M"), there exists a vector y such that (v(M)—v(M’))y = d 
modulo n. This proves that Cg 4 0 if and only if d divides @. For a fixed column y’ € Cg # 0, 
the mapping y € Cg — y—y’ € Co is a bijection. Hence, the non-empty sets Cg have equal 
cardinality. We conclude that each entry of (v(M) — v(M'))Hx modulo n is uniformly 
distributed in dZ,,. 


Lemma 5. Let M and M' be multisets of B. Let d be the greatest common divisor of n 
and each of the differences |M, — Mj|, b € B. Given knowledge of a full rank matrit A 
over Z,, corresponding to different r*’s and given knowledge of h = AHx modulo n, the 
probability that v/M)H = v(M')Hx modulo n is equal to (d/n)!. 


Proof. By Lemma 3, since matrix A corresponds to different r’’s and (v(M) —u(M")) ori) = 
0, the probability that the randomly chosen matrix Hx satisfies 0 = (v(M) — v(M’)) Hx 
modulo n is independent of the knowledge of h = AH; mod n. By Lemma 4, since Hx 
is uniformly distributed, (v(M) — v(M’))Hx is uniformly distributed in dZ!,. Hence, the 
probability that 0 = (v(M)—v(M’))Hx mod nis equal to one divided by the cardinality 
of dZ',, which is equal to (d/n)!. 


Combining Lemmas 2 and 5 proves Theorem 5. To prove Theorem 1 we need the 
following extra lemma. 


Lemma 6. Suppose that v(M) = v(M') modulo n, yep Mo = yep Mj modulo L, 
the cardinalities of M and M' are < L, and that the multiplicities of M are <n. Then 
M=M"'. 


Proof. If the cardinalities of M and M’ are equal modulo L and < L then 


S> M, = > Mj. (5) 


beB beEB 


If all entries of v(M) are < n and v(M) = v(M’) modulo n, then 
My = Mp + Gon, bE B, (6) 


for integers 3, > 0. Combining (5) and (6) proves }°,<, 5) = 0, hence, all 6, = 0. We 
conclude that M = M’. 


Now we are ready to prove Theorem 1. 


Proof. Let A(Hx) be a probabilistic polynomial time (in mx * 3m) algorithm with 
oracle access to (HK,+n,,=H,)- Then A(Hx) can gain knowledge about at most a 
polynomial number u(m) tuples [M* ; Hx(M"*)] (here u(.) denotes a polynomial). Fur- 
thermore, A(Hx) can search for a collision among at most a polynomial number t(m) of 
pairs (M, M’), where M and M’ are multisets, M #4 M’, and M has multiplicities < n. 
According to Theorem 5, the probability that A(Hx) finds a collision is at most 


t(m)(u(m)?/2™ + (d/n)’). 
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Since A(H x) can only compute polynomial sized multisets, the cardinality of the 
multisets M and M’' are < L = 2”. This allows us to apply Lemma 6 and conclude that 
0 4 (v(M) — v(M’)) modulo n. Hence, the greatest common divisor d of n and each of the 
differences |M, — M;|, b € B, is at most n/2. This leads to 


(d/n) <2! 


Let c > 0 be any number and suppose that 2~! > m~°, or equivalently, | < clog m. Notice 
that each of the differences |M, — Mj| is polynomial sized in m, hence, d is polynomial 
sized in m and there exists a number e > 0 such that d < m® for m large enough. This 


proves 
(d/n)! a me /n! ~ me (gm Nae 


which is at most m~° for m large enough. We conclude that the probability that A(Hx) 
finds a collision is at most m~° for m large enough. This proves Theorem 1 for random 
matrices Hx. 


Remark. The theorem also holds for a pseudorandom family of hash functions represented 
as matrices. Suppose that an adversary can compute a collision with a significant proba- 
bility of success in the case where a pseudorandom family of hash functions is used. We 
have just shown that an adversary has a negligible probability of success in the case where 
random hash functions are used. Hence, with a significant probability of success he is able 
to distinquish between the use of pseudorandom hash functions and the use of random 
hash functions. This contradicts the definition of pseudorandomness, see [BGR95] for a 
detailed proof of a similar result. 


B. Variants of Additive Multiset Hash 


A few interesting variants of Hx exist. Suppose that v(M/) = v(M’) modulo n and that 
the multiplicities of M and M’ are < n. Then clearly M = M’. Hence, we do not need 
Lemma 6 in the proof of Theorem 5. This means that the proof of Theorem 5 does not 
depend on the cardinalities of M and M’ to be equal modulo L. We can remove the 
cardinality )7,<, Mp from the scheme altogether. For example, for n exponentially large, 
the cardinalities and in particular the multiplicities of M and M’ are < n. This proves 
Corollary 2. An other example is n = 2 and both M and M’ are sets, which proves the 
the main result of [BGR95]. 

Secondly, it is possible to replace the random nonce r by a counter that gets incre- 
mented on each use of 7x, or by any other value that never repeats itself in polynomial 
time. This guarantees with probability 1 that the matrix A corresponds to different r‘’s 
(see Lemma 2). This removes the need for a random number generator from the scheme. 
Moreover, shorter values can be used for r as long as the key is changed when r overflows; 
this reduces the size of the hash. 

If uw = 0 then the proof of Theorem 5 does not depend on matrix A and its correspond- 
ing r’’s. Similarly, if sums of hashes, 


Hx(0,r)+ 5) MpHx(1,b) mod n, 
bEeB 
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are hidden from the adversary (he knows which multiset M is being hashed, but not the 
value of the sum of hashes) then we can remove Hx (0,7) from the scheme altogether. As 
the following corollary shows, complete hiding is not necessary. We can use a pseudorandom 
permutation to hide sums of hashes. 


Corollary 4. (Permuted-MSet-XOR-Hash) The multiset hash corresponding to 


Pret [@ met») S> My mod 2) 


beB beB 


Hx,K'(M) = 


where Hx : {0,1} x B— Z' and Px: are randomly selected from a pseudorandom family 
of hash functions and permutations, is set-collision resistant. 
(Permuted-MSet-Add-Hash) The multiset hash corresponding to 


Hx,K(M) = Pr (= M,Hx (1, b) mod | 
be B 


where Hx : {0,1} x B— Zam and Px are randomly selected from a pseudorandom family 
of hash functions and permutations, is multiset-collision resistant. 


Notice that the multiset hashes are incremental because Px: is a permutation and, 
hence, invertible. 


Proof. We first consider a random function Px. Suppose that the adversary learns u 
tuples [M* ; Hxx:(M"*)]. As in Lemma 2, the probability that two permuted sums of 
hashes in the u tuples are equal is at most u?/2™. If all of them are unequal to one 
another then matrix AH, (defined without the part corresponding to the random nonce) 
is uniformly distributed and not known to the adversary (since Px: is a random function). 
Hence, the probability that v(M)Hx = v(M’)Hx modulo n is statistically independent of 
the knowledge of the adversary. This can be used instead of Lemma 5 to prove Theorems 
5 and 1. This result also holds for a pseudorandom family of permutations Px’, see the 
remark at the end of the proof of Theorem 1 in Appendix A. 


C_ Proof of Collision Resistance of Multiplicative Multiset Hash 


In the following lemma A(.) is a probabilistic polynomial time (in log qg) algorithm which 
outputs weights? wi,...,W, € Z,-1 for a polynomial number of random inputs 71,...,2u € 
GF (q) such that 1 = |], 7; with probability at least p. We show that if such an algorithm 
exists then we can break the DL problem in GF‘(q) in polynomial time with probability 
at least p. 


Lemma 7. Let A(.) be a ppt algorithm such that there exists a number c such that for 
u < (log q)°, 


(xi — GF(q) iL, (wi € Zq-1)jf1 — A(@1,-- +, Lu): > 
Prob { 1= [],a,*,. diay 4.0, Valen,| = Cogg)* = p. (7) 


* Not all equal to zero and each of them bounded by a polynomial number. 
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Let g be a generator of GF(q). Then there exists a probabilistic polynomial time (in log q) 
algorithm A‘(.) such that 


Prob{y — GF(q),a — A'(y) : y= g"} = p/(log a). 


In words, given a random y € GF(q), we are able to find the discrete log of y in GF(q) 
with probability at least p/(log q)°. 


Proof. Let y — GF‘(q). Select a polynomial number u of random elements r1,...,7y, in 
Zq-1 and j € {1,...,u} and compute 


rj = yg") and x; = g" fori Fj. 


Compute (wj,...,Wu) — A(x1,...,%). Since by construction the x;s have been chosen 
uniformly at random, we know that with probability at least p the weights w1,...,wu € 
ZZ 4-1 are computed such that they are not all equal to zero, |w;| < (log q)°, and 

i 


Since the u inputs are in random order, the probability that w; 4 0 is at least 
1/u = (logq) “. 


Suppose that w; # 0. Let d be the greatest common divisor between w; and q— 1. Then? 
w;/d is invertible in Z ,_1. By using the Chinese remainder theorem (assuming that we 
know the factorization of g — 1), we are able to compute the inverse of w;/d in Z 1 in 
polynomial time. Denote this inverse by Wi. From (8) we infer that 


d —w' SO, rwi 
yoSg 4 ui . 


Notice that if y4 = g° and y = g’, then g® = g°, that is dt = s modulo q— 1. Recall 
that d divides q —1. For this reason d must also divide s. Let d’ = (q—1)/d and s’ = s/d. 
Both can be computed in polynomial time as we have shown. Now y can be expressed as 
one of the roots 

y= g° +jd' 
where 0 < j < d—1. Since d < |w;| < (logq)°, each of the roots can be checked in 
polynomial time. This proves the lemma. 


The DL assumption states that for all ppt algorithms A(.), any number c, and Q large 
enough, 


q > Q is a prime power, g generates GF(q), - Es 
Prob eens eis 
ecru y= 9" ¢ < (logq) 


We are ready to prove Theorem 2. 


5 ee s sos : : . : 
° Division / denotes division over integers, not over Zq-1 (since d has no inverse in Z,g-1, we can not 
divide w; by din Zq-1). 
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Proof. Suppose that there exists a number c and a probabilistic polynomial time algorithm 
B(#H), which runs in time u = (logq)°, with access to a random oracle H which outputs 
with probability p > 1/wa collision M for M’. That is, M4 M’, M and M’ are polynomial 
sized <u, and 
H(M) = |] 4(o) = [J #0) = HUW). 
beEB beEB 


This means that 


1=][ aH) ™, 
beB 


there is a polynomial number M;,’s and Mj’s unequal to zero, for all b € B the absolute 
value |!) — Mj| < u is polynomial sized, and there exists a b € B such that My, — Mj 4 0. 

Let C be an algorithm that goes from GF(q)“ to B — GF(q), where B — GF(q) 
denotes the set of oracles with inputs in B and outputs in GF(q). C is chosen such that 
C(x1,...,%y) returns x; when it is called for the first time on some input y;, x2 when it 
is called for the first time on some input y2 different from y,, and so on. 

When 21,...,2 are chosen randomly, C(a1,...,2%,) cannot be distinguished from a 
random oracle by B because 6 cannot query C more than u times. Therefore, if we let A 
be the composition of B and C, A is able to find a collision for 7 with probability p when 
its inputs are chosen uniformly at random. Moreover, A is a ppt algorithm satisfying (7), 
so by Lemma 7, A can break the discrete log problem in G'F(q) in polynomial time with 
probability at least p/(log q)° > (log q)~2°. This contradicts the DL assumption. So B does 
not exist, which proves multiset-collision resistance. 

Because oracle access to H is stronger than oracle access to H, this proves Theorem 2 
when # is arandom oracle. The result carries over to poly-random functions because they 
are indistinguishable from random functions by ppt algorithms. 


Remark. Supposing that H is a random oracle is a strong assumption. Compared to 
the MSet-XOR-Hash and MSet-Add-Hash we do not need a secret key (as the seed of a 
pseudorandom family of hash functions) at all. We refer to [BR93] for a discussion into 
what extent the random oracle assumption can be met in practice. 


D_ Proof of Collision Resistance of Vector Additive Multiset Hash 


If r is a fixed constant in the MSet-Add-Hash, then we are again vulnerable for the attack 
described for the MSet-XOR-Hash, where r is a fixed constant. The main difference is 
that the attack is not modulo n = 2 but modulo n = 2™. This means that the linear 
combination may lead to a collision with large multiplicities. This would give a non- 
polynomial sized collision and does not defeat the multiset collision resistance. It turns 
out that this problem is related to a weighted knapsack problem (see also [BM97]). In 
this sense MSet-Add-Hash remains multiset collision resistant, even if the pseudorandom 
family of hash functions Hx is replaced by a single random function avoiding the use of 
a secret key as in MSet-Mu-Hash. 
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The weighted knapsack (WK) assumption is defined as follows. For all ppt algorithms 
A(.), any number c, q large enough, and wu < (log q)°, 


(Bye By)tys (ty OB) MH Altice tu) * _ 
v q 4 2 
Prob { =) wa, mod g,. So; 0. -V,)e;| = Cos g)* ['— (log q) 


Notice the resemblance with (7), where multiplication in GF'(q) is now replaced by addition 
modulo g (where q can be any integer and does not need to be a prime power). It remains 
unclear to what extent Ajtai’s work [Ajt96] relates this problem to the worst-case shortest 
vector problem. It is an open problem whether to believe in the WK assumption. 

Let H: B— Z, be a poly-random function. We define 


H(M) = > M,H(b) mod q, (9) 
beEB 


= to be equal to =, and + to be addition modulo gq (q plays the role of 2” in 
MSet-Add-Hash). The proof of the next theorem is similar to the proof of Theorem 2 
in Appendix C. 


Theorem 6. Under the WK assumption, (H,+xH,=n) as defined in (9) is multiset colli- 
sion resistant. 


For completeness, we introduce a multiset hash corresponding to parameters n = 2V™ 
and I = \/m (see Section 4). Let H : B > Z!, be a poly-random function. Now, we define 


H(M) = S° M,H(b) mod n, 
beB 


= to be equal to =, and + to be vector addition modulo n. Theorem 6 holds again if 
we modify the WK assumption by replacing 7; — Zq by xj — we wi € Bq by wi € Zn, 
and q by n. The main difference is that the x;’s are vectors of length | = \/m. According to 
[GGH96, Sections 2.1 and 2.2]®, if there is a ppt solving the modified WK problem (that 
is it contradicts the modified WK assumption) then, by Ajtai’s theorem [Ajt96], there is a 
probabilistic polynomial (in /) algorithm which, for any lattice £ in IR’, given an arbitrary 
basis of £, approximates (up to a polynomial factor in /) the length of the shortest vector 
in £L. This proves Theorem 3. The worst-case shortest vector problem is believed to be 
hard, see [GGH96] for more discussion. 


E_ Proof of Improved Offline Checker 


In this appendix, we prove Theorem 4. 


Proof. Suppose the RAM does not behave like valid RAM (i.e. the data value that the 
checker reads from a particular address is not the same data value that the checker most 


6 Notice that the matrix with columns 7; is in Tix and that the vector with entries w; is unequal to 
zero and has Euclidean norm polynomial in | = \/m. 
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recently wrote to that address). Let W be the multiset of triples written to RAM and let 
R be the multiset of triples read from RAM. We will prove that W 4 R. 

Consider the put and get operations that occur on an address as occurring on a 
timeline. To avoid confusion with the values of TIMER, we express this timeline in terms 
of processor cycles. Let x; be the cycle of the first incorrect get operation. Suppose the 
checker reads the pair (v1, t1) from address a at 21. If there does not exist a cycle at which 
writes the pair (v1,t,) to address a, then W 4 R and we are done. 

Suppose there is a cycle x2 when (v4, ¢;) is first written to address a. Because of line 3 
in the get operation, the values of time stamps of all of the writes to a after x, are 
strictly greater than t;. Because the time stamps at x; and x2 are the same, and since 
put operations and get operations do not occur on the same cycle, x2 occurs before x1 
(a2 < 2%). Let x3 be the cycle of the first read from a after x2. Notice that x is a read 
after 22, so 71 > x3. If x1 were equal to x3, then the data value most recently written to 
a, i.e. vy, would be read at x1. This contradicts the assumption that x; is an incorrect 
read. Therefore, x71 > x3. 

Because the read at cycle x is the first incorrect read, the read at cycle x3 is a correct 
read. So the read at x3 reads the same pair that was written at x2. Again, because of 
line 3 in the get operation, the values of time stamps of all the writes to a after x3 are 
strictly greater than t;. Therefore, (v1,¢1) cannot be written after x3. Because x2 is the 
first cycle on which (v,t1) is written to a, (v,,t,) cannot be written before x2. Because 
x3 is the first read from a after x2, and two writes to an address always have a read from 
that address between them, (v1, ¢1) cannot be written between x2 and x3. Therefore, the 
pair (v1, ¢,) is written only once, but it is read at x; and x3. Therefore, W # R. 
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